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. Abstract 

^: 

' Modern data acquisition routinely produces massive amounts of network data, 

'sj" ' Though many methods and models have been proposed to analyze such data, the 

■ research of network data is largely disconnected with the classical theory of statistical 

learning and signal processing. In this paper, we present a new framework for model- 
ing network data, which connects two seemingly different areas: network data analysis 
and compressed sensing. From a nonparametric perspective, we model an observed 
' network using a large dictionary. In particular, we consider the network clique detec- 

c/3 ' tion problem and show connections between our formulation with a new algebraic tool, 

namely Randon basis pursuit in homogeneous spaces. Such a connection allows us to 
^ identify rigorous recovery conditions for clique detection problems. Though this paper 

1/^ . is mainly conceptual, we also develop practical approximation algorithms for solving 

^ . empirical problems and demonstrate their usefulness on real-world datasets. 
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c3 ; I. Introduction 

In the past decade, the research of network data has increased dramatically. Examples 
include scientific studies involving web data or hyper text documents connected via hyper- 
links, social networks or user profiles connected via friend links, co-authorship and citation 
network connected by collaboration or citation relationships, gene or protein networks con- 
nected by regulatory relationships, and much more. Such data appear frequently in modern 
application domains and has led to numerous high-impact applications. For instance, detect- 
ing anomaly in ad-hoc information network is vital for corporate and government security; 
exploring hidden community structures helps us to better conduct online advertising and 
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marketing; inferring large-scale gene regulatory network is crucial for new drug design and 
disease control. Due to the increasing importance of network data, principled analytical and 
modeling tools are crucially needed. 

Towards this goal, researchers from the network modeling community have proposed many 
models to explore and predict the network data. These models roughly fall into two cate- 
gories: static and dynamic models. For the static model, there is only one single snapshot 
of the network being observed. In contrast, dynamic models can be applied to analyze 
datasets that contain many snapshots of the network indexed by different time points. Ex- 
amples of the static network models i nclude the Erdos-Renyi-Gilbe rt ra ndom graph mode l 
jErdos and Renvil . Il959[ll960h . the pi jHoUand and Leinhardtl.ll98lh. Vo JPuiin et all . liooj) 
and more general exp onential ra n dom g raph (or p*) mo del (Wasserman and Pattison . 1996, ). 
latent space m odel (Hoff et al. . 2001), block rn odel ( Lorrain and White . Il971 ). stochas- 
tic blo ckmodel (IWasserman and Anderson! . 119871 ). and mixed membership stochastic block- 
model (lAiroldi et al.l . 120081). Examples of the dyn amic network models inc lude the preferen- 



tial a ttachment model (IBarabasi and Albertl.ll999l). the small- world model (IWatts and Strogatzl . 



19981 ). duplication-attac hment rnodel (IKleinberg et al.l . Il999l : iKumar et al.l. l2000l), continu- 
ous t ime Markov model (jSniidersl . l2005l ). and dynamic latent sp ace model (jSarkar and Moore 
20051 ). A comprehensive review of these models is provided in IColdenberg et al.l J201oh " 



Though many methods and models have been proposed, the research of network data analysis 
is largely disconnected with the classical theory of statistical learning and signal processing. 
The main reason is that, unlike the usual scientific data for which independent measurements 
can be repeatedly collected, network data are in general collected in one single realization 
and the nodes within the network are highly relational due to the existence of many link- 
ages. Such a disconnection prevents us from directly exploiting the state-of-the-art statistical 
learning methods and theory to analyze network data. To bridge this gap, we present a novel 
framework to model network data. Our framework assumes that the observed network has a 
sparse representation with respect to some dictionary (or basis space). Once the dictionary 
is given, we formulate the network modeling problem into a compressed sensing problem. 
Compressed sensing, also known as compressive sensing and compressive sampling, is a tech- 
nique for finding sparse solutions to underdetermined linear systems. In statistical machine 
learning, it is related to reconstructing a signal which has a sparse representation in a large 
dictionary. The field of compressed sensing has existed for de c ades, but re c ently it has ex - 
ploded due to the import ant contributions of ICandes and Tad (120051 . 120071 ): ICanded ( l2008l ): 
Tsaig and Donohd (120061 ). By viewing the observed network adjacency matrix as the output 
of an underlying function evaluated on a discrete domain of network nodes, we can formulate 
the network modeling problem into a compressed sensing problem. 



Specifically, we consider the network clique detection problem within this novel framework. 
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By considering a generative model in which the observed adjacency matrix is assumed 
to have a sparse representation in a large dictionary where each basis corresponds to a 
clique, we connect our framework with a new algebraic tool, namely Randon basis pur- 
su it in homogeneous spa c es. O ur problem can be regarded as an extension of the work 



m 



Jagabathula and Shahl (120081 ) which studies sparse recovery of functions on permutation 



groups, while we reconstruct functions on k-sets (cliques), often called the homogeneous 
space associated with a permutation group in the literature (IDiaconisl . Il988l ). It turns out 
that the discrete Radon basis becomes the natural choice instead of the Fourier basis con- 



sidered in IJagabathula and Shahl (120081 ) . This leaves us a new challenge on addressing the 
noiseless exact recov ery and stable recovery with ii oise. Unfortunately, the greedy algorithm 
for exact recovery in I Jagabathula and Shahl (120081 ) cannot be applied to noisy setti ngs, and 
in ge neral the Radon basis does not satisfy the Restricted Isometry Property (RIP) (jCanded . 
20081 ) which is crucial for the universal recovery. In this paper, we develop new theories and 
algorithms which guarantee exact, sparse, and stable recovery under th e choice of Radon 
basis. These theories have deep roots in Basis Pursuit (IChen et al.l . Il999l ) and its extensions 
with uniformly bounded noise. Though this paper is mainly conceptual: showing the con- 
nection between network modeling and compressed sensing, we also provide some rigorous 
theoretical analysis and practical algorithms on the clique recovery problem to illustrate the 
usefulness of our framework. 

The main content of this paper can be summarized as follows. Section [2] presents the general 
framework on compressive network analysis. In Section [31 IH and El we consider the clique 
detection problem under the compressive network analysis framework. A polynomial time 
approximation algorithm is provided in Section [HI for the clique detection problem. We also 
demonstrate successful application examples in Section [71 Section [SI concludes the paper. 



II. Main Idea 

In this section we present the general framework of compressive network analysis with a 
nonparametric view. We start with an introduction of notations: let u = {ui, . . . , Ud)'^ G M*^ 
be a vector and /(■) be the indicator function. We denote 



|n||o = 5^/(«,^0), \\u\\2^ 



iMlloo = max \Uj\. 
j 



(2.1) 



We also denote by (■, ■) the Euclidean inner product and sign(M) = (sign('Ui), . . . , sign{ud))' 
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where 



sign(Uj 



+1 


-1 



Uj > 
Uj = 
Uj < 



(2.2) 



We represent a network as a graph G = {V, E), where V = {1, . . . , ra} is the set of nodes and 
E G V xV is the set of edges. Let B G M"^" be the adjacency matrix of the observed network 
with Bij represents a quantity associated with nodes i and j. With no loss of generahty, we 
assume that B is symmetric: B = B^ and diag(i?) = 0. With these assumptions, to model 
B we only need to model its upper-triangle. For notational simplicity, we squeeze B into a 
vector b G ffi*^ where M = n{n — l)/2 is the number of upper-triangle elements in B. Let 
f{V) G M^"^ be an unknown vector-valued function defined on V. We assume a generative 
model of the observed adjacency matrix B (or equivalently, b): 

b = f{V)+z, (2.3) 

where z G is a noise vector. We can view f{V) as evaluating a possibly infinite- 
dimensional function / on a discrete set V, thus the model (12.30 is intrinsically nonparametric 
and can model any static networks. 

Without further regularity conditions or constraints, there is no hope for us to reliably 
estimate /. In our framework, we assume that / has a sparse representation with respect to 
an M by dictionary A = [(piiV), . . . , 4>N(y)] where each (pjiV) G M*^ is a basis function, 
i.e., there exists a subset S C {1, . . . , N} with cardinality l^l <^ N, such that 

fiV) = Y,^MV)- (2-4) 

qeS 

In the sequel, we denote by Apq the element on the p-th row and g-th column of A. Here p 
indexes a pair of different nodes and q indexes a basis (t>qiV). To estimate /, we only need 
to reconstruct x = (xi, . . . ,xn)'^- Given the dictionary A, we can estimate / by solving the 
following program: 

(Po) min ||x||o s.t. Axll^ < 5 (2.5) 

where || ■ is a vector norm constructed using the knowledge of z. The problem in (12.51) is 
non-convex. In the sparse learning literature, a convex relaxation of (12. 5p can be written as 

(Pi) min ||x||i s.t. ||6 - < 5. (2.6) 

One thing to note is that the dictionary A can be either constructed based on the domain 
knowledge, or it can be learned from empirical data. For simplicity, we always assume A is 
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pre-given in this paper. In the following sections, we use the clique detection problem as a 
case study to illustrate the usefulness of this framework. 



III. Clique Detection 

In network data analysis, The problem of identifying communities or clique^ based o n partial 



inform ation arises frequentl y in many appli c ations, including identity man agement ( IGuibas 



2008 ). statistical ranking (Diaconis . 1988 : Jagabathula and Shahl . |2008| ). and social net- 



works (ILeskovec et al.l . l2010l ). In these applications we are typically given a network with its 
nodes representing players, items, or characters, and edge weights summarizing the observed 
pairwise interactions. The basic problem is to determine communities or cliques within the 
network by observing the frequencies of low order interactions, since in reality such low order 
interactions are often governed by a considerably smaller number of high order communities 
or cliques. Therefore the clique detection problem can be formulated as compressed sensing 
of cliques in large networks. To solve this problem, one has to answer two questions: (i) 
what is the suitable representation basis, and (ii) what is the reconstruction method? Before 
rigorously formulating the problem, we provide three motivating examples as a glimpse of 
typical situations which can be addressed within the framework in this paper. 

Example 1 (Tracking Team Identities) We consider the scenario of multiple targets 
moving in an environment monitored by sensors. We assume every moving target has an 
identity and they each belong to some teams or groups. However, we can only obtain 
partial interaction information due to the measurement structure. For example, watching a 
grey-scale video of a basketball game (when it may be hard to tell apart the two teams), 
sensors may observe ball passes or collaboratively offensive/defensive interactions between 
teammates. The observations are partial due to the fact that players mostly exhibit to 
sensors low order interactions in basketball games. It is difficult to observe a single event 
which involves all team members. Our objective is to infer membership information (which 
team the players belong to) from such partially observed interactions. 

Example 2 (Inferring High Order Partial Rankings) The problem of clique identifi- 
cation also arises in ranking problems. Consider a collection of items which are to be ranked 
by a set of users. Each user can propose the set of his or her j most favorite items (say top 
3 items) but without specifying a relative preference within this set. We then wish to infer 
what are the top k > j most favorite items (say top 5 items). This problem requires us to 
infer high order partial rankings from low order observations. 

Example 3 (Detecting Communities in Social Networks) Detecting communities in 



clique means a complete subgraph of the network. 
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social networks is of extraordinary importance. It can be used to understand the organization 
or collaboration structure of a social network. However, we do not have direct mechanisms 
to sense social communities. Instead, we have partial, low order interaction information. For 
example, we observe pairwise or triple-wise co-appearance among people who hang out for 
some leisure activities together. We hope to detect those social communities in the network 
from such partially observation data. 

In these examples we are typically given a network with some nodes representing play- 
ers, items, or characters, and edge weights summarizing the observed pairwise interactions. 
Triple-wise and other low order information can be further exploited if we consider complete 
sub-graphs or cliques in the networks. The basic problem is to determine common inter- 
est groups or cliques within the network by observing the frequency of low order interactions. 
Since in reality such low order interactions are often governed by a considerably smaller num- 
ber of high order communities. In this sense we shall formulate our problem as compressed 
sensing of cliques in networks. 

The problem we are going to address has a close relationship with community detection in 
social networks. Community structures are ubiquitous in social networks. However, there is 
no consistent definition of a "community". In the majority of research studies, community 
detections based on partitions of nodes in a netw ork. Among these works, the most famous 
one is based on the modularity (INewmaru . |2006| ) of a partition of the nodes in a group. A 
shortcoming in partition-based methods is that they do not allow overlapping communities, 
which occur frequently in practice. Recently there has been growing interest in studying 
overlapping community structures (ILancichinetti and Fortunatd . |2009| ). The relevance of 
cliques t o overlapping com munities was probably first addressed in the clique percolation 
method (jPalla et al.l . l2005l ). In that work, communities were modeled as maximal connected 
components of cliques in a graph where two /c-cliques are said to be connected if they share 
k — 1 nodes. In this paper, we pursue a compressive representation of signals or functions 
on networks based on clique information which in turns sheds light on multiple aspects of 
community structure. 



In this paper, we use the same definition as in iPalla et al.l (120051 ) but are more interested 
in identifying cliques. We pursue an alternative approach on exploring networks based on 
clique information which potentially sheds light on multiple aspects of community structures. 
Roughly speaking, we assume that there is a frequency function defined on complete low 
order subsets. For example, in some social networks edge weights are bivariate functions 
defined on pairs of nodes reflecting strength of pairwise interactions. We also assume that 
there is another latent frequency function defined on complete high order subsets which we 
hope to infer. Intuitively, the interaction frequency of a particular low order subset should 
be the sum of frequencies of high order subsets which it belongs to. Hence we consider 
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a generative mechanism in which there exists a hnear mapping from frequencies on high 
order subsets (usually sparsely distributed) to low order subsets. One typically can collect 
data on low order subsets while the task is to find those few dominant high order subsets. 
This problem naturally fits into the general compressive network analysis framework we 
introduced in the previous section. Below we demonstrate that the Radon basis will be 
an appropriate representation for our purpose which allows the sparse recovery by a simple 
linear programming reconstruction approach. 



IV. Radon Basis Pursuit 



A. Mathematical Formulation 

Under the general framework in (12. 3p . we formulate the clique detection problem into a 
compressed sensing problem named Radon Basis Pursuit. For this, we construct a dictionary 
A so that each column of A corresponds to one clique. The intuition of such a construction 
is that we assume there are several hidden cliques within the network, which are perhaps 
of different sizes and may have overlaps. Every clique has certain weights. The observed 
adjacency matrix B (or equivalently, its vectorized version b) is a linear combination of many 
clique basis contaminated by a noise vector e. 

For simplicity, we first restrict ourselves to the case that all the cliques are of the same size 
k < n. The case with mixed sizes will be discussed later. Let Ci, C2, . . . , C^r be all the 
cliques of size k and each Cj C V. We have = (2) . For each g G {1, . . . , N}, we construct 
the dictionary A as the following 

^ f 1 if the p-th pair of nodes both lie in Cq 
^'^ \ otherwise. 



The matrix A constructed above is related to discrete Radon transforms. In fact, up to a 
constant and column scaling, the transpose m atrix A* is called the discrete Radon transform 
for two suitably defined homogeneous spaces (jPiaconisl . Il988l ) . Our usage here is to exploit 
the transpose matrix of the Radon transform to construct an over-complete dictionary, so 
that the observed output b has a sparse representation with respect to it. More technical 
discussions of the Radon transforms is beyond the scope of this paper. 

The above formulation can be generalized to the case where 6 is a vector of length (") 
(j > 2) with the p'th entry in b characterizing a quantity associated with a j-set (a set with 
cardinality j). The dictionary A will then be a binary matrix i?-^'^ with entries indicating 



7 



whether a j-set is a subset of a fc-chque (a chque with k nodes), i.e., 

^j ,. r 1 if the p-th j-set of nodes all lie in the fc-clique Cg 
\ otherwise. 

Therefore, the case where b is the vector of length (2) corresponds to a special case where 
A = R^'''. Our algorithms and theory hold for general R^'^ with j < k. 

Now we provide two concrete reconstruction programs for the clique identification problems: 

(Vi) min||x||i s.t. b = Ax 
iVi^s) min||x||i s.t. ||Aa; — 6||oo < 5- 



Vi is known as Basis Pursuit (jChen et al.l . Il999l ) where we consider an ideal case that the 
noise level is zero. For robust reconstruction against noise, w e consider the relaxed p rogram 
Vi^s. The program in Pi ,5 differs from the Dantzig selector (jCandes and Tad . 120071 ) which 
uses the constraint in the form — 6)||oo < ^- The reason for our choice of Vi^s lies 

in the fact that a more natural noise model for network data is bounded noise rather than 
Gaussian noise. Moreover, our linear programming formulation of Vi^s enables practical 
computation for large scale problems. 



B. Intuition 



Let G = {V, E) be the network we are trying to model. The set of vertices V represents 
individual identities such as people in the social network. Each edge in E is associated with 
some weights which represent interaction frequency information. 

We assume that there are several common interest groups or communities within the net- 
work, represented by cliques (or complete sub-graphs) within graph G, which are perhaps of 
different sizes and may have overlaps. Every community has certain interaction frequency 
which can be viewed as a function on cliques. However, we only receive partial measure- 
ments consisting of low order interaction frequency on subsets in a clique. For example, in 
the simplest case we may only observe pairwise interactions represented by edge weights. 
Our problem is to reconstruct the function on cliques from such partially observed data. 
A graphical illustration of this idea is provided in Figure [H in which we see an observed 
network can be written as a linear combination of several overlapped cliques. 

One application scenario is to identify two basketball teams from pairwise interactions among 
players. Suppose we have xq which is a signal on all 5-sets of a 10-player set. We assume 
it is sparsely concentrated on two 5-sets which correspond to the two teams with nonzero 
weights. Assume we have observations b of pairwise interactions b = Axq + z, where z is 
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Figure 1: An illustrative example of the main idea. 



uniform random noise defined on [— e, e]. We solve with S = e, which is a linear program 
over X e m('°) = M^^^ with parameters A G m(2°)''(5°) = m45x252 ^^^^ ^ ^ ^45_ 

C. Connection with Radon Basis 

Let Vj denote the set of all j-sets of ^ = {!,■■■ ,n} and be the set of real- valued 
functions on Vj. The observed interaction frequencies b on all j-sets, can be viewed as a 
function in . We build a matrix R^'^ : — )■ [j < fc) as a mapping from functions 
on all fc-sets of V to functions on all j-sets of V. In this setup, each row represents a j-set 
and each column represents a k-set. The entries of W''' are either or 1 indicating whether 
the j-set is a subset of the k-set. Note that every column of W''' has Q) ones. Lacking a 
priori information, we assume that every j-set of a particular k-set has equal interaction 
probability, whence choose the same constant 1 for each column. We further normalize W''' 
to W'^ so that the ^2 norm of each column of is 1. To summarize, we have 

r 1 

if CT C t; 
otherwise, 



where a is a j-set and r is a k-set. As we will see, this construction leads to a canonical basis 
associated with the discrete Radon transform. The size of matrix i?-^''^ clearly depends on 
the total number of items n = \ V\. We omit n as its meaning will be clear from the context. 

The matrix R^'^ constructed above is related to discrete Radon transforms on homogeneous 
space M^. In fact, up to a constant, the adjoint o perator (R^'^)* is called the discrete Radon 



transform from homogeneous space to in iDiaconid (119881 ). Here all the fc-sets form 
a homogeneous space. The collection of all row vectors of R^'^ is called as the j-th Radon 
basis for M''. Our usage here is to exploit the transpose matrix of the Radon transform to 
construct an over-complete dictionary for , so that the observation b can be represented 
by a possibly sparse function x G [k > j). 



The R adon basis was proposed as an efficient way to study partially ranked data in iDiaconis 



(119881 ). where it was shown that by looking at low order Radon coefficients of a function on 
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M^, we usually get useful and interpretable information. The approach here adds a reversal 
of this perspective, i.e. the reconstruction of sparse high order functions from low order 
Radon c oefficients. We wi l l discuss this in the seq uel with a connection to the compressive 
sensing (IChen et al.l . 1 19991 : ICandes and Tad . l2005l ) . 



V. Mathematical Theory 



One advantage of our new framework on compressive network analysis is that it enables 
rigorous theoretical analysis of the corresponding convex programs. 



A. Failure of Universal Recovery 



Recently it was shown by ICandes and Tad (120051 ) and ICandej (|2008[ ) that Vi has a unique 



sparse solution xq, if the matrix A satisfies the Restricted Isometry Property (RIP), i.e. 
for every subset of columns T C {1, . . . , A^} with |T| < s, there exists a certain universal 
constant 5s G [0, \/2 — 1) such that 

(1 - 5s)\\A\l < UtxWI < (1 + 6s)\\A\l Vx G Rl^l, 

where At is the sub-matrix of A with columns indexed by T . Then exact recovery holds 
for all s-sparse signals Xq (i.e. Xq has at most s non-zero components), whence called the 
universal recovery. 

Unfortunately, in our construction of the basis matrix A, RIP is not satisfied unless for very 
small s. The following theorem illustrates the failure of universal recovery in our case. 



Theorem 5.1. Let n > k + j + 1 and A = R^'^ with j < k. Unless s < {'^'^1^^), there does 
not exist a 6s < 1 such that the inequahties 

(1 - 6s)\\x\\l < \\Atx\\1 < (1 + 5.)||x||^, Vx G RI^I 

hold universally for every T C {1,...,N} with \T\ < s, where N = (l). 



Note that {'^'^l^^) does not depend on the network size n, which will be problematic. We 
can only recover a constant number of cliques no matter how large the network is. The 
main problem for such a negative result is that the RIP tries to guarantee exact recovery for 
arbitrary signals with a sparse representation in A. For many applications, such a condition 
is too strong to be realistic. Instead of studying such "universal" conditions, in this paper we 
seek conditions that secure exact recovery of a collection of sparse signals Xq, whose sparsity 
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pattern satisfies certain conditions more appropriate to our setting. Such conditions could 
be more natural in reality, which will be shown in the sequel as simply requiring bounded 
overlaps between cliques. 

Remark 5.2. Recall that the matrix A has altogether = columns. Each column in 
fact corresponds to a fc-clique. Therefore, we could also use a fc-clique to index a column 
of A. In this sense, let T = {ii, . . . , i^} C {1, . . . , A^} be a subset of size k. An equivalent 
notation is to represent T as a class of sets: T = {ri, . . . , Tk} where each C {1, . . . , n} and 
Irl = k. 



Proof. We can extract a set of columns T = {r:rC {1,2, ■■■ ,k + j + l} and |r| = k} 
(r is interpreted as a k-set) and form a submatrix At- Recall that A has altogether 
number of rows. Combined with the condition that n > k + j + 1 and the fact that the 
number of nonzero rows of At should be exactly ('^^•'^^). We know that there must exist 
rows in At which only contains zeroes. 

By discarding zero rows, it is easy to show that the rank of At is at most [^^j'^^) , which is 
less than the number of columns. To see that the rank of At is at most (^'^+-'+^^ ^ we need to 
exploit the fact that j < k, therefore 



j J \ k 

from which we see that the number of nonzero rows of At is smaller than the number of 
columns. 

Thus, the columns in At must be linearly dependent. In other words, there exist a nonzero 
vector h G where supp (h) C T such that Ah = 0. When s > {'"^i^^), Since |supp(/i)| < 
|T| < s, we can not expect universal sparse recovery for all s-sparse signals . □ 



B. Exact Recovery Conditions 



Here we present our exact recovery conditions for xq from the observed data b by solving 
the linear program Vi. Suppose A is an M-by-A^ matrix and xq is a sparse signal. Let T = 
supp(a;o), be the complement of T, and At (or Atc) be the submatrix of A where we only 
extrac t column set T (or T*^, respectively). The following proposition from lCandes and Tao 
( 120051 ) characterizes the conditions that Vi has a unique condition. To make this paper 
self-contained, we also include the proof in this section. 



Proposition 5.3. (ICandes and Tad . 120051 ) Letxo = (xqi, 
is invertible and there exists a vector w 6 M*^ such that 



xqn)'^ , we assume that A^At 
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1. {Aj,w) = sign(xoj),Vj € T; 

2. \{A,,w)\<l,yj eT'. 

Then Xq is the unique solution for Vi- 

Proof. The necessity of the two conditions come from the KKT conditions of Vi. If we 
consider an equivalent form of Vi 



min 1 ^ 
subject to Ax — 6 = 

^ > 

whose Lagrangian is 

L{x, 7, A, ^) = l^e + l^'iAx - b) - - x) - X^M + - l^^i- 

Here 7 G M*^ A+ = (A+(l), . . . , \+{N)Y e R^^, A_ = (A_(l), . . . , X-{N)f eR"^, fiE 
are the Lagrange multiphers. 

Then the KKT condition gives 

1. A*7 + (A+-A_) =0, 

2. l-(A+ + A_)-/i = 0, 

with A,/i > and A+(j)A_(j) = for all j. 

Clearly T = supp(a;o) = {?' : > |. Let w = 7, by the Strictly Complementary Theorem 



for linear programming in[Yg ( 1997 ). there exist jj and ^ such that 1 > /ij > for all j G T'^ 



with = 0, and fij = for all j E T with > 0. Thus, the first equation leads to 

{w, Aj) = -(A+(j) - A_(j)) = -sign(xoj), j G T; 
the second equation leads to 

|(^,A,)| = |A+(j)-A-(j)l = l-/^. <!• 
Therefore, the two conditions are necessary for xq to be the unique solution of Vi. 
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To prove that these two conditions are sufficient to guarantee Xq is the unique minimizer 
to Vi, we need to show any minimizer yo to the problem Vi must be equal to xq- Since xq 
obeys the constraint Axq = b, we must have 

Ibolli < Ikolli- 

Now take a w obeying the two conditions, we then compute 
Ibolli = X^koi + (l/oi-a;oj)| + X^|z/oj| 

> Y sign(xoj) (soj + {yoj -xoj)) + Y^ Voj {w, Aj) 

= Y M + X](l/oi - Xoj) {w, Aj) + Y yoj (w, ^j) 

jeT \ jeruT': jeT 

= ||xo||i + {w,b- h) 

— Il^olli 

Thus, the inequalities in the above computation must in fact be equality. Since | (w, Aj) \ is 
strictly less than 1 for all j ^ T, this in particular forces y^j = for all j ^ T. Thus 

X^(l/oj - Xoj)Aj = / - / = 0. 

Since all columns in At are independent, we must have yoj = xqj for all j G T. Thus xq = yo. 
This concludes the proof of our theorem. 



The above theorem points out the necessary and sufficient condition that in the noise-free 
setting Vi exactly recover the sparse signal xq- Th e necessity and sufficie ncy comes from 
the KKT condition in convex optimization theory (jCandes and Tad . 120051 ). However this 
condition is difficult to check due to the presence of w. If we further assume that w lies in 
the column span of At, the condition in Proposition 15.31 reduces to the following condition. 

Irrepresentable Condition (IRR) The matrix A satisfies the IRR condition with respect 
to T = supp(xo), if A^At is invertible and 



\AtcAt{A*j.At) 



I oo ^ 1; 
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or, equivalently, 

\\{A*r,AT)-^At,AT4i<l, 

where || ■ ||oo stands for the matrix sup-norm, i.e., ||v4||oo := maxj | | and ||v4||i = 
maxj Ylii 



Proposition 5.4. By restricting that w lies in the image of A^, the conditions in propo- 
sition \5i^ reduce to the IRR condition. 



Proof. Since w lies in the image of At, we can write w = Atv. To make sure that the 
first condition in Proposition 15.31 holds, we must have v = {A^AT)~^sign{xo) , so 

w = ATiA*TAT)-^sign{xo). 
Now the second condition in proposition 15.31 can be equivalently written as 

\\A*T.AT{A*TATy^\\^ < 1, 

which is exactly the IRR condition. □ 

Intuitively, the IRR condition requires that, for the true sparsity signal xq, the relevant 
bases At is not highly correlated with irrelevant bases A^. Note that this condition only 
depends on A and xq, which is easier to check. The assumption that w lies in the column 
span o f At is mild; it is actually a necessary condition so tha t xn c an be reconstructed by 
Lasso (ITibshiranil . Il996) or Dantzig selecto r (jCandes and Tad, even under Gaussian- 



like noise assumptions ( IZhao and Yd . l2006l : lYuan and Lin 
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C. Detecting Cliques of Equal Size 

In this subsection, we present sufficient conditions of IRR which can be easily verified. We 
consider the case that A = R^'^ with j < k. Given data b about all j-sets, we want to infer 
important fc-cliques. Suppose xq is a sparse signal on all /c-cliques. We have the following 
theorem, which is a direct result of Lemma 15.61 



Theorem 5.5. Let T = supp(a;o), if we enforce the overlaps among k-cliques in T to be 
no larger than r, then r < j — 2 guarantees the IRR condition. 



Lemma 5.6. Let T = supp(xo) and j > 2. Suppose for any ai,a2 G T, the two cliques 
corresponding to (Xi and 02 have overlaps no larger than r, we have 
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1. Ifr < j - 2, then \\A*t.At{A*tAt)-^\\^ < 1; 

2. Ifr = j — l, then \\A^cAt{A^At)^^\\oo < 1 where equahty holds with certain examples; 

3. If r = j, there are examples such that \\A^cAt{A^At)~^\\oo > 1- 

One thing to note is that Theorem [53] is only an easy-to- verify condition based on the worst- 
case analysis, which is sufficient but not necessary. In fact, what really matters is the IRR 
condition. It uses a simple characterization of allowed clique overlaps which guarantees the 
IRR Condition. Specifically, clique overlaps no larger than j — 2 is sufficient to guarantee the 
exact sparse recovery by Vi, while larger overlaps may violate the IRR Condition. Since this 
theorem is based on a worst-case analysis, in real applications, one may encounter examples 
which have overlaps larger than j — 2 while Vi still works. 

In summary, IRR is sufficient and almost necessary to guarantee exact recovery. Theorem 1 5. 5 1 
tells us the intuition behind the IRR is that overlaps among cliques must be small enough, 
which is easier to check. In the next subsection, we show that IRR is also sufficient to 
guarantee stable recovery with noises. 

Proof. To prove Lemma 15. 6[ given any r G T^, we define 

CT') 



eT \j) 

the intuition of such a definition is that 



sup fir = \\A^cAt\\oo- (5.2) 

As we will see in the following proofs, we essentially try to bound fir for ^ ^ T'^- 

Before we present the detailed technical proof, we ffist introduce the high-level idea: our 
main purpose is to bound ||y4^cAT(A7-AT)~^||oo- Since each entry of the matrix A^At is 
indexed by two /c-sets, the value of this entry represents how many j-sets are contained in 
the intersection of these two k-sets. Under the condition that r < j — 1, it's straightforward 
that the matrix A^At is an identity. Therefore, bounding ||y4^cAr(A^AT)~^||oo is equivalent 
as bounding HAycArlloo, which is exactly sup^g^^c /ir- 

Proof of the case under Condition 1 

Under Condition 1, since any ai, (T2 G T satisfy fl cr2| < j — 2, hence any two columns in 
T are orthogonal. This implies A'^At is an identity matrix. 

Now given r G T'^, we will prove ftr < 1 under condition 1. If this is true, then 

sup fir = \\A*^cAt\\oo = \\A*^cAT{A*TATy^\\^ < 1 
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Let T — {(7i, (72, • • • , a\T\\ where (7j(l < i < \T\) are A;-sets. We need to prove 

= E —k- < 1 

i=l \j) 

for all T e 

Let A^j = {p : \p\ = j,p C rHai}, so Aii is a collection of j-scts of rflaj (Here if |rn(Ti| < j, 
then A^j is simply an empty set). Obviously, we have = ('^^.'^''). So 

I'^l /I o l\ 

1=1 ^ / i=i 



Now we note the fact that for any 1 < ij < |T|, we have Mi fl Aii — 0. This is true 
because otherwise suppose p e A^i n then this mean p is a j-set of M.i and M.2- Hence 
p G T n ai, p G T n a2, which implies that 

ki nasi > \{rr\ai) n (rnaa)! > \p\ > j. 

This contradicts with the condition that (Tj's(l <i<T) have overlaps at most j — 2. So M.i 
must be pairwise disjoint. Hence 

I'^l /I o l\ 

E(''7'')=Ei><.i=iuriMi 

i=l ^ ^ i=l 



For any 1 < « < |T|, every p G A^j is a j-sct of r H (jj. Hence p is of course a j-set of r. The 
set r is of size k. So if wc let M.q = {p : \p\ = j, p G r} which is the collection of all j-sets 
of r, then we have u[^^Mt G Mq. So | U^^ M^\ < \Mo\ < (J). 

Till now, we actually proved Pr < 1- All the above proof about /ir < 1 for any t eT^ will 
remain valid for condition 2. In the next, we prove if any cTj, ai & T satisfy \ai fl < j — 2, 
then equality can not hold. 

Without loss of generality, we assume |crinr| > j, otherwise if none of cxi's satisfies Icriflrl > j, 
then pr = which actually finishes the proof. To show the the equality will not hold, we 
only need to find one j-set that is does not belong to UjAlo- 

In this case, wc can let r = {1, 2, ■ ■ ■ , k}, a\ — {1, 2, • • • , s, /c + 1, A; + 2, 2A; — s} where 
j < s < A; — l(s < A; — 1 because otherwise 0\ = r which contradicts with the fact that 
ai E T,T E T^). Now we show that po = {li 2, ■ ■ ■ , j — 1, s + 1} is not a member of U^J^TWj. 
Clearly po is not a member of A^i because s + 1 ^ a^. Now it remains to show that po is not a 
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member of any M.i{2 <i< \T\). If this was not true, say po e M.2, then po C (Tn(72) C (T2, 
then {1, 2, • • • , j — 1} C cti n (72, which contradicts with the condition that |(Ti n (T2I < j — 2. 



\T\ 1 

While it is clear that po inMo, so this means U[J-^^M.i is a proper subset of Mq. So | U-^^'j^ 
M.i\ < which means /i,- < 1. 

Proof of the case under Condition 2 

Under condition 2, then almost the same as proof for lemma 1. We have A^At is an identity 
matrix and /x^ < 1- However, one can not show /i^ < 1 in this case. Wc have the following 
example where if n is large enough, then Pt- can happens to be equal to one exactly. 

Let T = {1, 2, • • • , /c} e T^. Denote all the j-sets of r to be pi, p2, • • ■ : P^^^ ■ when n is large 
enough, we choose (^) disjoint (A;— j)-sets of {k+1, k+2, ■ ■ ■ , n}, denoted by oji,uj2, ■ ■ ■ , oj^k-^ . 



Let T = {(Ji, (72, • • • , cr\T\}, where cTj = pjUcjj. Hence |T| = ( .') and cxi's satisfy |(Tin(Tj| < j — 
But ^ 

\T\ (\Tnai\\ \T\ 
i=l \j) i=l \j) 

Proof of the case under Condition 3 

Under condition 3, we can construct examples where 

\\A*j,.At{A^At)-'\\oo > I. 

Let pi, P25 ■ ■ ■ ! be all j-sets of {1, 2, ■ ■ ■ , k}. For large enough n, it is possible to choose 
(^) + 1 disjoint (A;— j)-sets of {k + 1, k + 2, - ■ ■ ,n}, say ujo,uji,uj2, ■ ' ' i "^(^^y Let (Xj = PiUcOi for 

1 < ^ < (^) and (To = pi Ucjq. Define T = {(Tq, (Ji, (J2, • " " > crf^fe)} which is of size |T| = + L 

In this case, |(7i fl (T;| —j — 1 for any 1 < i,l < Q) and |(To r\ai\ — j, |(To fl (7i| < j — 1 for any 

2 < i < (^) ■ Then A^At is a (J) + 1 by (^) + 1 matrix shown below with rows and columns 
corresponds to {(Tq, (7i, • • • , c^fc^} 



' 1 


e 





• 


■ ' 


e 


1 





• 


■ 








1 


• 


• 











1 • 


• 












. 











• 


• 1 
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Here e = vir- The inverse of the matrix is 



T 



1 

l-e2 
e 


tE 

l-e2 
1 

l-e2 






■ 
• 


■ " 

■ 








1 


• 


■ 











1 ■ 


■ 












. 











■ 


■ 1 



Consider r = {1, 2, - ■ ■ ,k} G T'^, then the row corresponds to r for A^cA^ is a vector of 
length |T| = + 1 with each entry being e = -4t- So the row vector corresponds to r in 

A^cAt{A^At)~^ is a vector of length Q) + 1, [j^, j^, ^ e]- This vector has row sum 



2e 



+ 



1> 



2e 



1 + e \lJ ' 1 + e 
Hence in this example || A^cAt(A^At)"^ 



> 1. 



1 + 2e - e^ 1 + 2e - e 

> 



1 + e 



1 + e 



In the following, we construct explicit conditions which allow large overlaps while the IRR 
still holds, as long as such heavy overlaps do not occur too often among the cliques in T . 
The existence of a partition of T in the next theorem is a reasonable assumption in the 
ne twork settings where netwo rk hierarchies exist. In social networks, it has been observed 
by lGirvan and Newman! (120021 ) that communities themselves also join together to form meta- 
communities. The assumptions that we made in the next theorem where we allow relatively 
larger overlaps between communities from the same meta-community, while we allow rela- 
tively smaller overlaps between communities from different meta-communities characterize 
such a scenario. 



Theorem 5.7. Assume {k + l)/2 < j < k. let T = supp(xo). Suppose there exist a 
partition T = Ti U T2 U ■ ■ ■ U with each satisfies |Tj| < K, such that 

• for any ai, aj belong to the same partition, |cri fl crj| < r; 

• for any ai, aj belong to different partitions, |aj fl cTjI < 2j — k — 1. 

If K satisfies 

ix - 1) f : V a < 1/4. f f * : ^ V - 1) \ I 1 < 3/4, 



3/ \3J W J / \ J / \J 



then IRR holds. 
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Proof. We will show the following two inequahties hold. 

P..w,u.,.-:)(;)/C:), P...,u.(C->(.-.fr))/C). 

We first bound the sup-norm of A^At — I. Note that when (7^ and aj belong to different 
partitions of T, then |(7j fl =0 because their overlap is no larger than 2j — k — 1 which is 
strictly smaller than j. So A^At is a block diagonal matrix with block sizes |Ti|, IT2I, • • • , 
\T^\, and each diagonal entry of A^At is one. 

Thus, for any a & T, only cliques from the same partition as a may have overlaps with a 
greater than j. Thus, the row sum of A^At — I can be bounded by {K - 1) O/f). So the 
first inequality is now established. 

To prove the second inequality, we observe that for a fixed t e T'^, Irdai] > j and ImcTjl > j 
can not hold at the same time for any (jj and aj belong to different partitions. This is because 
otherwise, we will have 

\t\ > |r n ((Tj U (Tj)| = |r n cTjI + |r n CTjl — |r n (7j n (Tjl 

> j+j-i2j-k-l) = k + l 

Thus, all (t's which have intersections with a fixed r no less than j must lie in the same 
partition of T. 

For the same reason, we can show that for a fixed r G T*^, |r fl (Tj| > (A; + r + l)/2 and 
|r n (Tj\ > (A; + r + l)/2 can not hold at the same time for (jj and aj belong to the same 
partition of T. This is because otherwise, we will have 

\t\ > \t u (7j)| = |t n (7j| + |t n (7j| — |t n (7j n aj\ 

> {k + r + l)/2 + {k + r + l)/2-r^k + l 



Thus we know the maximum row sum of A'^cAt is bounded from above by 




then, we have 

\\A*tAt - /||oo < 1/4, II^tc^tIIoo < 3/4. 
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Thus, 



\\A*T.AT{A*TATy^\\o^ 

< p;..AT||oo||(^T^T)~ioo 

oo 

< \\A*^.AT\Ul + Y,m*rAT-I)\U 

1=1 

oo 

< 3/4(1 + J](l/4r) = 1 

1=1 

So IRR holds under our conditions. □ 

The basis matrix A = W''^ have (2) bases, which is not polynomial with respect to k. As 
we will see from later sections, a practical implementation of the Radon basis pursuit for 
the clique detection problem works on a subset of bases among all (^) bases. In that case, 
we are actually solving Vi and Vi^s with the basis matrix A, which is only a submatrix of 
A with a subset of column bases extracted. We have the following theorem regarding this 
scenario. 



Theorem 5.8. Denote the set of all cliques for columns in A by S, where A is a submatrix 
of A. Assume any two k-cliques in A have intersections at most r, i.e. ^(Ji,aj G T VJT^, 
IcTj n (jjl < r, where T = supp(a;o) C S, and is the complement of T with respect to S. 
Then IRR holds if 

1 \ 

r < I ^ k (5.3) 

- ' |r|(i + / 



Proof. Note that 



\A*t.At{A*tAt) ioo < \\A*j,.At\\oo\\{A*tAt) loo 

< \\A*t.At\\oo- ^/\f\\\iA*TATr'\\2 



So it suffices to show 

\\A*t.At\\oo-V\T\\\{A*tAt)-'\\2<1 

under condition (15. 3p . 
Firstly, 



I^T'^^tIIoo = max^^ I < ,^ , since |r n o-| < r. 
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At least we need 

iri(;;)/Q<i. ,5.4) 

Secondly, let K = A^At, then 

Kii = l 

and since VcXj, aj G T, |crj fl cxjl < r, we have 





V — fk\ 



Under condition (15 ■4p . K is diagonal dominant, i.e. 



Then by Girshgorin Circle Theorem, 



,.,.,:-i:i-«i^i-(m-<;)/e)^iHri(;)/Q 



Therefore it suffices to have 



\T\ (p ^\ 

i-mC)/©^' 



which gives 

r\ 1 /A; 

< 



vjy |r|(i + Vj 

To satisfy this, it suffices to assume 

i/i 

r < I 1 k. 

m(i + v^) 



i^. Stable Recovery Theorems 

In applications, one always encounters examples with noise such that exact sparse recovery is 
impossible. In this setting, Vi^s will be a good replacement of Vi as a robust reconstruction 
program. Here we present stable recovery theorem of Vi^s with bounded noise. 
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Theorem 5.9. Under the general framework fl2.3p . we assume that ||-2||oo ^ ^, \T\ = s, 
and the IRR 

\\A*t.At{A*tAt)-^\\^ < a < 1/s. 
Then the following error bound holds for any solution xs of Vi^s, 

\\xs - xolli < '^^^^^\\At{A*^At)-%. (5.5) 
1 — as 

Proof. Let h = xs — Xq. Note that \\Axs — &||oo < ^ and z = Axq — b with \\z\\qo < £• 
Then 

\\Ah\\^ = \\Axs~Axo\\^ = \\Axs-b + b- AxoW^ < \\Axs - b\\^ + \\z\\^ < 6 + e. (5.6) 

We denote xs\t as constraining xg on the support T, i.e. all the entries ofxs corresponding to 
will be set to zero. From the optimization problem in {Vi,s), we know that ||a;o||i > ||^<5||i) 

II^tIIi = \\xo - xs\t\\i > Ikolli - WxsWWi > WxsWi - WxsItWi = \\x5\tA\i = Whr-Wi- (5-7) 
Therefore, 

\{Ah,ATiA*TAT)-'hT)\ 
= {{AxhT, ATiA^ATy^hr) + {ArchTc, AT{A^AT)~^hT)\ 

> WhrWl - \{hT^,A*T.AT{A*TAT)-'hT)\ 

> \\hT\\l-\\hT4l\\ATcAT{A*TAT)-'hT\\oo 

> -WhrWi - aWhTcWiWhrWoc 
s 

> -WhrWi - a\\hT4i\\hT\\i 
s 

> (^l-ay\hr\\l 

where the last step is due to WhxWi > ||/iT'=||i in the inequality ( 15. 7p . On the other hand, 

\{Ah,AT{A*TAT)-'hT)\ 

< \\Ah\U\ATiA*TAT)-'hTh 

< {6 + e)\\AT{A*TAT)-%\\hT\h 

using fl5.6p . Combining these two inequalities yields 

\\hTh<'^^^^\\AT{A*^AT)-'\\i, 
1 — as 
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as desired. 



In the special case where A; = j + 1, we have: 

Corollary 5.10. Let k — j + 1, \T\ — s, and for any ai,a2 G T, the two cliques corre- 
sponding to o\ and 02 have overlaps no larger than r. Then we have \\A^cAt{A^At)~^\\oo < 
' + 1), and thus the following error bound for solution xs of Vi^s holds: 

_ „ ^ 2s{e + 5) r-. — - . ^ 

\\X5 -XoWi < 1 ^ W + 1^ s<j + l. 

Proof. This corollary follows follows from the Lemma above. Note that when the con- 
ditions in Theorem 2 hold, A^At = / and ||At||i < ^JJ^ = ^/JTT. 

Now it suffice to establish the fact that in this special case, we have 

\\A^.At{A^At)-^\\^ < < 1 

Note that since any (Ji, (T2 G T" satisfy |(7i n(T2| < j — 2, we have A^At is an identity matrix. 

So \\A*j.cAt{A*j,At)~'^\\oo = II^^'^AtIIoo. Now assume r e T", let Sr = {(t : |o-nr| > j, a e T}, 
then \St\ < 1. This is because otherwise, suppose {ai, (J2} C St such that l^"^! > 2, then we 
have 

|t| > Irn (cTi u 0-2)1 = |rn (Til + |in (721 - lincTi n(72| 

> j+j-(j-2)=j + 2 

which contradicts with the fact that r is a j + 1-set. So there exist at most one (Tq G T such 
that |r n (t| > j. Let Vr be the row vector of A'^cAt with row index correspond to r. Then 



■^r 00 



< - 1 <- 1 



E. Identifying Cliques with Mixed Sizes 

In general settings, we need to identify high order cliques of mixed sizes, i.e., cliques of 
sizes fci, k2, - ■ ■ ,ki {ki < k2 < ■ ■ ■ < ki), based on the observed data b on all j'-sets. One 
way to construct the basis matrix A is by concatenating i?-^''^ with different /c's satisfying 
k > j. We can then solve Vi and Vi^s ior exact recovery and stable recovery with this newly 
concatenated basis matrix A. We have the following theorem: 
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Theorem 5.11. Suppose Xq is a sparse signal on cliques of sizes ki,k2, - ■ ■ ,ki{j < ki < 
k2 < ■ ■ ■ < ki < k) and b = Axq. Let T = supp(xo). 

1. If the cliques in T have no overlaps, then they can be identified by Vi. 

2. Moreover, if the data b = Axq + z is contaminated by the noise z, Vi^s provides an 
estimate of Xq for which the inequality in still holds. 

Proof. We prove under the condition that any ai,a2 G T satisfy |cri fl = 0, then 
solve Pi wiU exactly identify Xq. 

For simplicity, given any t & T'^, we define 



Note that the intersection of ai and cr2 is zero implies that A^At = /, moreover, given 
r G T*^, the collection of sets {r fl a\a G T} are disjoint. Note that if there is only one ctq 
satisfies |r fl (TqI > j, then 



because it is the inner product of two column vectors corresponds to r and ctq of ^) where 
there are no two columns in A are identical. 

Now suppose there are at least two cr's satisfy, |r fl o"| > j, then we have 






Since the collection of sets {r fl a\(y G T} are disjoint, so if we can prove 




24 



then we know that 




Now we only need to prove the following inequality: suppose j > 2, given rii > j, n2 > j, we 
need to prove ^ + ^ < ^TT) 

The case of j = 2 can be verified directly, while for j > 3, we square both sides and we now 
we only need to prove ("/) + ("/) + 2^(7) (7) < ("^+"'). Since 

CTM: (;-.)(:)■ 

So we know we only need to prove 2^(7) (7) < 712 (/_\) +ni (^^\) . Since 712 (^^1^) +ni (^'^\) > 

2^nin2 {J'^i) , so we only need to verify riiQ'^-^) > (7), this can be easily verified by 
writing out explicitly both sides. □ 

The above theorem provides us a sufficient condition to guarantee exact sparse recovery with 
concatenated bases and the stable recovery theory is also established. 

VI. A Polynomial Time Approximation Algorithm 

In practical applications, we have pairwise interaction data in a network with n nodes and we 
wish to infer high order cliques up to size k. Directly constructing A by concatenating Radon 
basis matrices R^'^ , W'^^^ . . . , and solving Vi^s would incur exponential complexity since 
A has exponentially many columns with respect to k. This would be intractable for inferring 
high order cliques in large networks. In this section, we describe a polynomial time (with 
respect to both n and k) approximation algorithm for solving Vi^^. Recall that the primal 
and dual programs Vi^s and Vi^^ are: 

{Vi^s) niin||x||i s.t. - 6||oo < 

(Pi,^) max-5||7||i - 6*7 s.t. \\A*^\U < 1- 



Proposition 6.1. The problem {Vi^s) is the dual of (Pi,^). 
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Proof. Consider an alternative form of Vi^s, 




whose Lagrangian is 



L{x,^;^,X,l^) = l^^-^lid-l-Ax + b)-^^{Ax-b + S-l)-Xl{^-x)-X^_{^ + x)-iF^. 



Here if we assume A is a matrix of size M by A^, then 7+ = (7+(l), . . . , 7+(M)) G , 7_ = 



(7-(l),---,7-(M)) e Mf, A+ = (A+(l),...,A+(iV)f e R^, A_ = (A_(l), . . . , A_(iV)f e 



W^, fi e are the Lagrange multiphers. 
Then the KKT condition gives 

1. A*(7+-7_) + (A+-A_) = 0, 

2. l-(A+ + A_)-// = 0, 

with 7, A, > and 7+(t)7_(t) — A+(r)A_(T) for all r. 
Now we can see that the dual function of 1^^ is 



The key of our algorithm is that we use a polynomial number of variables and constraints to 
approximate both programs, yielding an approximate solution for Vi^s- More precisely, we 
apply a sequential primal-dual interior point method to solve the relaxed programs: 



Here At is a submatrix of A where we extract a subset of columns T. We approximate the 
solution to the original programs by solving the above relaxed programs where we only use 
polynomially many columns indexed by T. In particular, we want to find an interior point 
7 for T>i^s,T which is also feasible for T>i^s- With this 7 available, we can use duality gaps to 
check convergence because the current dual objective provides a lower bound for Di s and 
any interior point for Vi^s,t provides an upper bound for Pi 5. 



Shi + 7^) • 1 - (7^ - 7^-)b, 



which is — 5II7II1 — fe*7, while the constraints for 7 is ||74*7||oo < 1- 



□ 



min||x||i s.t. || ^t^^ — ^||oo < ^ 
max— (5||7||i —6*7 s.t. ||>1^7||oo < 1. 
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Let Ai be the i-th column of A. We need to sequentially update the column set T. When 
we have a solution 7 (which is called the approximate analytic center) for the relaxed pro- 
gram Vi^s,T, we need to find a new column Ai {i G T'^) which is not feasible in T>i^s,t- By 
incorporating Ai into T, the feasible region of 2^i,5,t is reduced to better approximate that 
of Vi^s- When the current solution 7 has no violated constraint, i.e., 7 is feasible for 1^1,5, we 
use interior point methods to find a series of interior points which converge to the solution 
of ^^i,5,r- However, we may obtain a new interior point 7 which is not feasible for I^i^j. We 
then go back and add violated constraints. A formal description is provided in Algorithm [TJ 



Algorithm 1 Cutting Plane Method for Solving Vi^s 
Initialize A = /, x = 6, 7 = (1, 1, ■ ■ ■ , 1)*. 
while TRUE do 

if 3 1^*7 1 > 1 where i e T" then 

T ^ T U {i}, formulate new ^^i,5,r and Vi^s,t- 
Find new interior points 7 and x for T>i s^t and Vi^s,T respectively, 
else if the duality gap is small then 

Get the dual solution x and stop, 
else 

Find a new interior point 7 for Vi^s,T, which optimizes the dual objective, 
end if 
end while 



In Algorithm [H the first IF statement involves a problem of finding a violated dual constraint 
for the current relaxed program. In the special case where 7 are dual variables associated 
with edges, the problem becomes the maximum edge weight clique problem, which is known 
to be NP-hard. We use a simple greedy heuristic algorithm, which itera tively adds ne w nodes 
in order to maximize summation of edge weights to solve this problem (iLuekerl . Il978l ). which 
runs in 0{nk^) time and can return a 0.94-approximate solution in the average case. Note 
that, if 7 is feasible for the dual relaxation problem with no additional violated constraints, 
then 0.947 must be feasible for Di ,5 whose objective is discounted by 0.94. Thus, we will 
terminate with an 0.94-approximate solution. 

Let 1] be the threshold to check the duality gap. Algorit hm [1] can also be understood as 
the column generation method (jPantzig and Wolfd . Il960l ). since adding a new inequality 



Mitchell 


(2003 


) and 


Ye 



(I1997I ). Theoretically, if one is able to find a violated constraint in constant time and uses 
interior point methods to locate approximate centers of the primal-dual feasible regions. 
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then Algo rithm 1 has computation al complexity 0{M/rf)^ where M is the number of dual 
variables (IMitchelll . 120031 : lYd . Il997l ). In our case, M ^ and find a violated constraint has 
complexity 0{nk'^), thus algorithm [1] has complexity 0{7r'k'^ /rf). 

Finally, we note that othe r iterative algori thms, e.g., Bregman iterations, which have guar- 
anteed convergence rates (ICai et al.l . 120091 ) can be used to find solutions of linear program 
relaxations in our algorithms. We also note that, in practice, we never need to explicitly 
construct the matrix A because there are many combinatorial structures within the basis 
matrix to exploit. For example, operations such as evaluating inner products between the 
bases can be evaluated efficiently by directly comparing two sets. 



VII. Application Examples 



In this section, we provide four application examples to illustrate the effectiveness of the 
proposed framework in this paper. As we will see, our clique-based model can deal with 
overlaps between cliques which gives us more community structural information compared 
against using purely clustering methods and the state-of-the-art clique percolation method. 
In these examples, we use the clique volume and conductance, which arguably are the simplest 
evaluation criteria of clustering quality, to evaluate different algorithms. The clique volume 
is the sum of edge weights inside the clique, while the clique cond uctance is the ratio be tween 



the number of weights leaving the clique and the clique volume (ILeskovec et al.l . |2010| ). 



More precisely, let Buv be the element on the w-th row and f-th column of the adjacency 
matrix B. The conductance (p{S) of a set of nodes S is defined as 



min(Vol(5),Vol(V\5)) 
and volume is Vol(S') = Buv 

A. Basketball Team Detection 



Detecting two basketball teams from pairwise interactions among plays is an ideal scenario 
since the two teams do not overlap. Suppose we have Xq which is the true signal indicating 
the two teams among all 5-sets of the 10-player set, i.e., it is sparsely concentrated on two 5- 
sets which correspond to the two teams with magnitudes both equal to one. Assume we have 
observations b of pairwise interactions, i.e. b = Axq + z, where z is bounded random noise 
uniformly distributed in [— e, e]. We solve Vi^s, with S = e, which is a linear programming 
search over x G ]r('°) = M^^^ with a parameter matrix A G mC2'')^C5°) = ]^45x252 ^^^^^ ^ ^ ]^45_ 
The results are shown in Figure [2l In Figure [2]- (a), we see that the two basketball teams 
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Noise Level e 

(a) (b) 

Figure 2: Detecting Basketball Teams with Noise, (a) Two teams in a virtual Basketball 
Game, with intra-team interaction 1 and cross-team interaction noise no more than e; (b) 
Under a large noise level e < 0.9, the two teams are identifiable. For each noise level, we run 
100 simulations repeatedly, whose errorbar plot of weights on cliques are shown. 



are perfected detected as expected. Since the two 5-sets correspond to the two teams have 
no overlap, hence satisfy the irrepresentable Condition (IRR). In Figure |2]-(b), we try to 
detect the two teams under different noise levels e G [0, 1]. The two basketball teams can be 
detected under fairly large noise levels. This example can also be dealt with using spectral 
clustering techniques where we normalize the pairwise interaction data to get the transition 
matrix, followed by spectral clustering on eigenspaces. We observed that both our method 
and spectral clustering works very well under noise level less than 0.8 (i.e. |e| < 0.8). 



B. The Social Network of Les Miserables 



We c onsider the social network of 33 characters in Victor Hugo's novel Les Miserables (IKnuth 



19931 ). We represent this social network using a weighted graph (Figure [3]- (a)). The edge 
weights are the co-appearance frequencies of the two corresponding characters. Table 1 
illustrates several social communities formed by relationships including friendships, street 
gangs, kinships, etc. The underlying social community, regarded as the ground truth for the 
data, is summarized in Figure |3]-(a) where several social communities arise. Figure [3]-(b) 
shows the spectral clustering result in which the first three red cuts are reasonable while the 
next three blue cuts destroyed a lot of community structures within the network. 
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Student Union 




Figure 3: Decomposition of Les Miserables social network, (a) Social network of characters in 
Les Miserables] (b) Spectral clustering result; (c) The identified 3-cliques; (d) The identified 
4-cliques. 

We compare our method with the clique percolation method, 23 and 19 cliques were identified 
respectively where our approach can identify more meaningful cliques - see Figure [3] and 
Table [1] where we verified the ground truth from the novel. For example, our method can 
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Radon Basis Pursuit — Clique Conductances 



Clique Percolation — Clique Conductances 
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Figure 4: Les Miserables social network: Box plot of clique conductances and volumes for 
clique percolation method and our approach. Cliques identified by our approach have smaller 
conductances and larger volumes. 

correctly identify two separate cliques {4, 15, 22} and {20, 21, 22}, while the clique percolation 
method treats {4, 15, 20, 21, 22} as a single clique. The interaction frequencies among those 
characters, however, show that there are relatively smaller cross- community interactions, 
thus those two 3-cliques should be separated. Figure [3]- (c) and [3]-(d) depict important 3- 
cliques and 4-cliques identified by our algorithm. The sparsity patterns of those cliques satisfy 
the irrepresentable condition where overlaps between them are generally not large. However, 
they do not necessarily satisfy the condition in Lemma 15.61 which is based on a worst- 
case analysis. In Figure HI we also compare both methods in terms of clique conductances 
and volumes and see that the chques identified by Radon basis pursuit have shghtly lower 
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conductances and larger volumes, which demonstrates advantages of our approach. 



Table 1: Social Networks of Les Miserables 



Cliques 


Names of Characters 


Relationships 


Pcrco. 


Radon 


{1,2,3} 


{Myriel, Mile Baptistine, Mme Magloire} 


Friendship 


N 


N 


{4,13,14} 


{Valjean, Mme Thenardier, Thenardier} 


Dramatic Conflicts 


N 


Y 


{4,15,22} 


{Valjean, Cosette, Marius} 


Dramatic Conflicts 


N 


Y 


{20,21,22} 


{Gillenormand, Mile Gillenormand, Marius} 


Kinship 


N 


Y 


{5,6,7,8} 


{Tholomyes, Listolier, Fameuil, Blacheville} 


Friendship 


Y 


Y 


{9,10,11,12} 


{Favourite, Dahlia, Zephine, Fantine} 


Friendship 


Y 


Y 


{14,31,32,33} 


{Thenardier, Gueulemer, Babet, Claquesous} 


Street Gang 


N 


Y 



In summary, our method obtains more abundant social structure information than the com- 
peting techniques. We also obtain social communities with overlaps which is impossible for 
clustering methods. We note that some simple schemes will not work well. For example, 
one may think of scoring each large clique by the mean scores of the included small cliques. 
In this example, since two or three key characters appear very frequently, we will end up 
with finding that the top high order cliques always contain them. In fact, among the top 
ten 3-cliques, seven of them contain node 4 and six of them contain node 15, which does not 
give us good results. 

C. Coauthorships in Network Science 

We also studied a medium size coauthorship network where there is a total of 1,589 scientists 
who come from a broad variety of fields. Part of this network is shown in Figure [3]-( a). 136 
and 166 cliques are identified by our approach and the clique percolation method respectively. 
We also compare the two methods in terms of clique conductances and volumes. From Figure 
[n]-(a),(b), we see that the chques identified by Radon basis pursuit have smaller conductances 
and comparable clique volumes than the clique percolation method. Our approach can scale 
very well. In this example, it can identify the cliques up to size 9 in 564 seconds. So 
this application example shows that our approach can be used to identify chques in social 
networks with hundreds or even thousands of nodes. 

Finally, we note that clustering techniques, e.g., spectral clustering, combined with our algo- 
rithm can provide a more refined analysis of the network. We can look at the persistence of 
identified cliques in the binary tree decomposition of bipartite spectral clustering of the net- 
work in a bottom-up way. Cliques which persist through more levels will give us meaningful 
community structural information. 

In figure [5]- (b), a small fraction of the binary tree decomposition of bipartite spectral clus- 
tering is depicted, where child nodes are spectral bipartition of the parent node. We can 
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(a) (b) 

Figure 5: (a) Coauthorships in Network Science, only a part of the network is shown; (b) 
Important chques identified within clusters behave in a persistent way. Clustering node B is 
exactly the blue part in (a) 

detect cliques within the child nodes. Once cliques within clusters C, D are identified, we 
then backtrack to the parent nodes B and A to see if the identified cliques still persist. 

We can identify 3 cliques (ci={Kumar, Raghavan, Rajagopalan, Tomkins}, C2={Kumar. S, 
Raghavan, Rajagopalan}, C3={Raghavan, Rajagopalan, Tomkins, Kumar. S}) within C and 
3 cliques ((ii={Flake. G, Lawrence. S, Giles. C, Coetzee. F}, (i2={Flake. G, Lawrence. 
S, Giles. C, Pennock. D, Glover. E}, d^={F\ak.e. G, Lawrence. S, Giles. C}) within 
D which persist to parents B and A. We can identify papers whose authors are exactly 
those cliques. Using only clustering will not get this result because those cliques have heavy 
overlaps between them. 

In figure [5]-(b), for simplicity, we only show two persistent cliques: Ci={Kumar, Raghavan, 
Rajagopalan, Tomkins} and di={F\ake. G, Lawrence. S, Giles. C, Coetzee. F} which are 
the most important cliques (having the largest weights when solving the LP program) in 
clusters C and D respectively. These two cliques are also the most important two cliques in 
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Figure 6: Coauthorship Network: Box plot of clique conductances and volumes for clique 
percolation method and our approach. Cliques identified by our approach have smaller 
conductances and larger volumes. 

cluster B, and if we even further back track them to clustering A, they are still ranked as 
the first and the third in terms of weights among all cliques identifiable in A. 



D. Inferring high order ranking 



Jester dataset ( iGoldberg et al.l . l200l[ l contains about 24, 000 users who give ratings on 100 
jokes. Those ratings are of real value ranging from —10.00 to +10.00. We extract top 20 
jokes from the entire dataset according to mean scores. Among those 20 jokes, we count the 
voting on top 5-jokes by each user and view them as the ground truth. Figure [3- (a) shows 
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that there is a top 5-set, {27, 29, 35, 36, 50}, with an overwhelming voting than the others. 

Now suppose we only know information as top 3 counts of the jokes and wonder if we can 
identify the most popular 5-joke group. By solving Vi^s with the whole regularization path 
by varying 6, we are capable to detect this subset (Figure [3- (b)) in a robust way. 



Distribution o1 votes on top 5 subsets Solution Pattt of ^ on Inferring Top 5 Jokes 




140 150 



(a) (b) 

Figure 7: (a) There is a significant top-5 jokes (in red) whose ID is {27,29,35,36,50}; (b) 
Regularization path where the top curve (red) selects this top group over 6 G [50, 130]. Note 
that the top 2"'^ curve (green) also identifies the fourth 5-set in a persistent way. 



VIII. Conclusions 

In this work, we present a novel approach to connect two seemingly different areas: network 
data analysis and compressive sensing. By adopting a new algebraic tool, Randon basis 
pursuit in homogeneous spaces, we formulate the network clique detection problem into 
a compressed sensing problem. Such a novel formulation allows us to construct rigorous 
conditions to characterize the network clique recovery problems. Instead of providing another 
heuristic method, we aim at contributing at the foundational level to network data analysis. 
We hope that our work could build a bridge connecting the research communities of network 
modeling and compressive sensing, so that research results and tools from one area could be 
ported to another one to create more exciting results. 

To illustrate the usefulness of this new framework, we present a novel approach to identify 
overlapped communities as cliques in social networks, based on compressed sensing with an 
new algebraic method, i.e. Radon basis pursuit in homogeneous spaces associated with per- 
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mutation groups. Our approach starts from a general problem of compressive representation 
of low order interactive information from high order cliques, which firstly arises from iden- 
tity management and statistical ranking, etc. Specifically applied to social networks, this 
approach studies bi-variate functions defined on pairs of nodes, and looks for compressive 
representations of such functions based on clique information in networks. It turns out that 
the sparse representation under Radon basis may disclose community structures, typically 
overlapped, in social networks. We have shown that noiseless exact recovery and stable re- 
covery with uniformly bounded noise hold under some natural conditions. Though this paper 
is mainly methodological and theoretical, we also develop a polynomial-time approximation 
algorithm for solving empirical problems and demonstrate the usefulness of the proposed 
approach on real-world networks. 
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